Novel loss functions for ensemble-based medical image classification

Medical images commonly exhibit multiple abnormalities. Predicting them requires multi-class classifiers whose training and desired reliable performance can be affected by a combination of factors, such as, dataset size, data source, distribution, and the loss function used to train deep neural networks. Currently, the cross-entropy loss remains the de-facto loss function for training deep learning classifiers. This loss function, however, asserts equal learning from all classes, leading to a bias toward the majority class. Although the choice of the loss function impacts model performance, to the best of our knowledge, we observed that no literature exists that performs a comprehensive analysis and selection of an appropriate loss function toward the classification task under study. In this work, we benchmark various state-of-the-art loss functions, critically analyze model performance, and propose improved loss functions for a multi-class classification task. We select a pediatric chest X-ray (CXR) dataset that includes images with no abnormality (normal), and those exhibiting manifestations consistent with bacterial and viral pneumonia. We construct prediction-level and model-level ensembles to improve classification performance. Our results show that compared to the individual models and the state-of-the-art literature, the weighted averaging of the predictions for top-3 and top-5 model-level ensembles delivered significantly superior classification performance (p < 0.05) in terms of MCC (0.9068, 95% confidence interval (0.8839, 0.9297)) metric. Finally, we performed localization studies to interpret model behavior and confirm that the individual models and ensembles learned task-specific features and highlighted disease-specific regions of interest. The code is available at https://github.com/sivaramakrishnan-rajaraman/multiloss_ensemble_models.


Introduction
Deep learning (DL) has demonstrated superior performance in natural and medical computer vision tasks. Computer-aided diagnostic tools developed with DL models have been widely used in analyzing medical images including Chest-X-rays (CXRs) and computerized tomography (CT). CXRs have been studied extensively where the models are used to predict manifestations of cardiopulmonary diseases such as pneumonia opacities, pneumothorax, cardiomegaly, Tuberculosis (TB), lung nodules, and, more recently, COVID-19 [1,2] pneumonia CXR dataset into normal, bacterial pneumonia, or viral pneumonia categories. Finally, the top-K (K = 3, 5) performing models are used to construct prediction-level and model-level ensembles. The performance of the individual models, prediction-level, and model-level ensembles are further analyzed for statistical significance. We also performed localization studies to ensure that the individual models and their ensembles learned task-specific features and highlighted the disease-manifested regions of interest (ROIs) in the CXRs.

Datasets
This retrospective study uses the following two datasets: i. Montgomery TB CXRs [19]: This is a publicly available collection of 58 CXRs showing TBrelated manifestations and radiologist readings and 80 CXRs showing lungs with no findings. The images and their associated lung masks are deidentified and exempted from the National Institutes of Health (NIH) IRB review (OHSRP#5357). We use this as an independent test set to evaluate the segmentation model proposed in this study.
ii. Pediatric pneumonia [6]: A set of 4273 CXRs showing lungs infected with bacterial and viral pneumonia and 1583 CXRs showing normal lungs are collected from children of 1 to 5 years of age at the Guangzhou Medical Center in China. The author-defined [6] training set contains 1349, 2538, and 1345 CXRs and the test set contains 234, 242, and 148 CXRs showing normal lungs, bacterial pneumonia, and viral pneumonia manifestations, respectively. The CXRs are acquired as a part of routine clinical care, curated by expert radiologists, and made publicly available with IRB approvals. We use this dataset toward classifying CXRs as showing normal lungs, bacterial pneumonia, or viral pneumonia manifestations.

Lung segmentation and cropping
As CXR images contain irrelevant regions that do not help in learning classification task-specific features, we segmented the ROI, i.e., the lungs from the CXRs, and used the lung-segmented images for training the classification models. Our review of the literature reveals that U-Net [20] is widely used for segmenting ROIs in natural and medical images. Further, the study of the literature shows that EfficientNet [18] models have achieved superior performance in natural and medical computer vision tasks, as compared to other models, in terms of accuracy, efficiency, and computational complexity. Hence, we used an EfficientNet-B0-based U-Net model [21] to perform pixel-wise segmentation. The EfficientNet-B0-based U-Net model is trained using the CXR collection and their associated lung masks discussed in [17] to minimize the following loss functions: (i) Binary cross-entropy (BCE), (ii) Weighted BCE-Dice [2], (iii) Focal [8], (iv) Tversky [22], and (v) Focal Tversky [23]. We used 10% of the training data for validation with a fixed seed. Each mini-batch of the training data is augmented using random affine transformations such as pixel shifting [-2 +2], horizontal flipping, and rotations [-5 +5] to introduce variability into the training process. The model is trained using an Adam optimizer with an initial learning rate of 1e-3. The learning rate is reduced whenever the validation loss ceased to improve. The model demonstrating the least validation loss is used to predict lung masks of a reduced 512×512 pixel resolution for the CXRs in the Montgomery TB CXR collection. The images are resized using bicubic interpolation from the OpenCV software library. The performance of the segmentation models is evaluated using the following metrics: (i) Segmentation accuracy; (ii) Dice coefficient, and (iii) Intersection over union (IoU). We selected the top-3 segmentation models from those that are trained using the aforementioned loss functions based on segmentation accuracy, Dice coefficient, and IoU metrics. The selected models are used to predict the lung masks for the CXRs in the Montgomery CXR collection. These masks are then bitwise-ANDed to produce the final lung mask. The bitwise-AND operation compares each pixel of the predicted masks by the top-3 performing models. If only all the pixels are 1, i.e., belonging to the lung ROI, the corresponding bit in the final mask is set to 1, otherwise, it is set to 0. The final lung mask is then overlaid on the original CXR image to delineate the lung boundaries and the bounding box containing the lung pixels is cropped. The resulting lung-cropped image is resized to 512×512 pixel resolution. Then, the cropped CXRs are contrast-enhanced by saturating the top and bottom 1% of all the image pixels followed by normalizing the pixels to the range [0 1]. Fig 1 shows the diagram of the segmentation module proposed in this study.

Classification module
The encoder from the trained EfficientNet-B0-based U-Net model is truncated at the 'block5-c_add' layer (TensorFlow Keras naming convention) with feature map dimensions of [16,16,512]. This approach is followed to transfer CXR modality-specific knowledge to improve performance in the current CXR classification task. The truncated model is appended with the following layers: (i) a zero-padding (ZP) layer, (ii) a convolutional layer with 512 filters, each of size 3×3, (iii) a global averaging pooling (GAP) layer; and (iv) a final dense layer with three neurons and Softmax activation, to classify the pediatric CXRs as showing normal lungs, bacterial pneumonia, or viral pneumonia manifestations. We used the train and test splits published in [6] to compare our model performance with the SOTA literature [6,24]. We allocated 10% of the training data for validation with a fixed seed. The model is trained using a stochastic gradient descent optimizer with an initial learning rate of 1e-3 and momentum of 0.9, to minimize the loss functions discussed in this study. The best-performing model is selected based on the least loss obtained with the validation data. These models are evaluated with the test set, and the performance is recorded in terms of The top-K (K = 3, 5) models that deliver superior performance with the test set are used to construct the ensembles. We constructed prediction-level and model-level ensembles. At the prediction level, the models' predictions are combined using various ensemble strategies such as majority voting, simple averaging, weighted averaging, and stacking. In a majority voting ensemble, the most voted predictions are considered final for classifying CXRs to their respective classes. In a simple averaging ensemble, the individual model predictions are averaged to generate the final prediction. For the weighted averaging ensemble, we propose to optimize the weights that minimize the total logarithmic loss so that the predicted labels converge to the target labels. We iteratively minimized the logarithmic loss using the Sequential Least-Squares Programming (SLSQP) algorithm [25]. In a stacking ensemble, the predictions are fed into a meta-learner that consists of a single hidden layer with 9 and 15 neurons respectively, for the top-3 and top-5 performing models. The weights of the top-K models are frozen and only the meta-learner is trained to optimally combine the models' predictions. A dense layer with three neurons and Softmax activation is appended to output prediction probabilities. For the model level ensemble, the top-K models are instantiated with their trained weights and truncated at their deepest convolutional layer. The features from these layers are concatenated and appended with a 1×1 convolutional layer, to reduce feature dimensions. This is followed by appending a GAP layer and a dense layer with three neurons and Softmax activation to classify the CXRs as showing normal lungs, bacterial pneumonia, or viral pneumonia manifestations. The performance of the individual models, prediction-level ensembles, and model-level ensembles are further compared for statistical significance. All the models are trained and evaluated using Tensorflow Keras 2.4 on a Windows system with an Intel Xeon 3.80 GHz CPU, NVIDIA GeForce GTX 1050 Ti GPU, and CUDA dependencies for GPU acceleration. Statistical significance analysis is performed using R software version 4.1.1.

Classification losses
We experimented with the following loss functions to provide a comprehensive evaluation of their impact on the multi-class classification task under study: (i) Categorical cross-entropy (CCE) loss; (ii) Categorical focal loss [8]; (iii) Kullback-Leibler (KL) divergence loss [26]; (iv) Categorical Hinge loss [27]; (v) Label-smoothed CCE loss [28]; (vi) Label-smoothed categorical focal loss [28], and (vii) Calibrated CCE loss [29]. We also propose several loss functions, as follows, that mitigate the issues with the existing loss functions when applied to the multi-class classification task under study: (i) CCE loss with entropy-based regularization; (ii) Calibrated negative entropy loss, (iii) Calibrated KL divergence loss; (iv) Calibrated categorical focal loss, and (v) Calibrated categorical Hinge loss. The details of the proposed loss functions are discussed below.
(i) CCE with entropy-based regularization. DL models demonstrate low entropy values for the output distributions when they are confident about their predictions [29]. However, under class-imbalanced training conditions, the models might be overconfident about the majority class and classify most of the samples as belonging to this dominant class. This may lead to model overfitting and adversely impact generalization performance. Under these circumstances, a penalty could be introduced in the form of a regularization term that penalizes peaked distributions, thereby reducing overfitting and improving generalization. A model produces a conditional distribution p O (y|x) through the Softmax function, over a set of classes y given an input x. The entropy of this conditional distribution is given by, Here, H denotes the entropy term. A regularization term is proposed where the negative entropy is added to the negative log-likelihood to penalize over-confident output distributions.

PLOS ONE
It is given by, Here, β controls the intensity of the penalty. Through empirical evaluations, we set the value of β = 2. We used this regularization term in the final dense layer as an activity regularizer and trained the model to minimize the CCE loss.
(ii) Calibrated negative entropy loss. We propose an entropy-based loss function where the negative entropy is added as an auxiliary term to the negative log-likelihood term as shown in Eqs [1] and [2] to penalize over-confident output distributions. A model is said to demonstrate poor calibration if it is overconfident or underconfident about its predictions and would not reflect the true occurrence likelihood of the class events. Motivated by [29], we propose to add a regularization term that computes the difference between the accuracy and the predicted probabilities to the entropy-based loss function. This regularization term helps to penalize the model when the entropy-based loss function reduces without a corresponding change in the accuracy. The regularization term forces the accuracy to match the average predicted probabilities, thereby (i) acting as a smoothing parameter that smoothens overconfident or underconfident predictions and (ii) pushing the model to converge to the ideal condition when the accuracy would reflect the true occurrence likelihood. The calibrated negative entropy loss is given by, Here, β controls the penalty intensity. The auxiliary term difference is calculated for each minibatch, as given by, Here, y 0 k denotes the predicted label. The value of c k is 1 if y 0 k ¼ y k ; otherwise, c k is 0. This auxiliary term forces the average value of the predicted probabilities to match the accuracy over all training examples. This pushes the model closer to the ideal situation, where the model accuracy would reflect the true occurrence likelihood of the samples. The auxiliary term serves as a smoothing parameter for predictions with extremely low or high prediction confidences. We tested with different weights for β = [0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 2] and λ = [0.5, 1, 2, 5, 10,15,20]. After empirical evaluations, we set the value of β = 0.001 and λ = 10.
(iii) Calibrated KL divergence loss. The KL divergence, also called relative entropy, measures the difference between the observed and actual probability distributions. The KL divergence between two distributions A(x) and B(x) is given by, We propose to benefit from the regularization term mentioned in Eq [4] to smoothen model predictions when trained to minimize the KL divergence loss. We propose the calibrated KL divergence loss where the regularization term in Eq [4] is added to the KL divergence loss. This is done to penalize the model when the KL divergence loss reduces without a corresponding change in the accuracy. The calibrated KL divergence loss is given by, The auxiliary term difference is calculated for each mini-batch and is given by Eq [4]. We tested with different weights for λ = [0.5, 1, 2, 5, 10, 15,20]. After empirical evaluations, the value of λ is set to 1.
(iv) Calibrated categorical focal loss. The principal limitation of CCE loss is that the loss asserts equal learning from all the classes. This adversely impacts training and classification performance during class-imbalanced training. This holds for medical images, particularly CXRs, where a class imbalance exists between the majority normal class and other minority disease classes. In this regard, the authors of [8] proposed the focal loss for object detection tasks, in which the standard cross-entropy loss function is modified to down weight the majority class so that the model would focus on learning the minority classes. In a multi-class classification setting, the categorical focal loss is given by, Here, K = 3, denotes the number of classes, k = {0, 1, K−1} denotes the class labels for bacterial pneumonia, normal, and viral pneumonia classes respectively, and p ¼ ðp 0 a vector representing an estimated probability distribution over the three classes. The value γ denotes the rate at which the easy samples are down-weighted. The categorical focal loss converges to CCE loss at γ = 0. We propose the calibrated categorical focal loss, where the difference between the accuracy and predicted probabilities is added as a regularization term to penalize the model for overconfident and underconfident predictions when trained to minimize the categorical focal loss. The calibrated categorical focal loss is given by, The auxiliary term difference is calculated for each mini-batch and is given by Eq [4]. We tested with different weights for γ = [0.5, 1, 2, 5] and λ = [0.5, 1, 2, 5, 10, 15, 20]. After empirical evaluations, the value of γ and λ is set to 1.
(v) Calibrated categorical Hinge loss. The Hinge loss is widely used in binary classification problems to produce "maximum-margin" classification [27], particularly with SVM classifiers. This loss could be used in a multi-class classification setting and is given by, Here, y true and y pred denote the ground truth one-hot encoded labels and predictions, respectively. We propose the calibrated categorical Hinge loss, where the difference between the accuracy and predicted probabilities is added as an auxiliary term to the categorical Hinge loss. This auxiliary term penalizes the model when the categorical Hinge loss reduces without a corresponding change in the accuracy. The calibrated categorical Hinge loss is given by,

Calibrated categorical Hinge loss
The negative and positive terms are given by Eqs [10] and [11]. The auxiliary term difference is calculated for each mini-batch and is given by Eq [4]. We tested with different weights for λ = [0.5, 1, 2, 5, 10, 15,20]. After empirical evaluations, the value of λ is set to 10.

CXR lung segmentation
Recall that an EfficientNet-B0-based U-Net model is trained to minimize BCE, weighted BCE-Dice, focal, Tversky, and focal Tversky loss functions and predict lung masks for the CXRs in the Montgomery TB CXR collection. The lung masks predicted by the top-3 performing models are bitwise-ANDed to produce the final lung mask. The performance of the individual models and the bitwise ANDed model ensemble is evaluated using segmentation accuracy, IoU, and Dice coefficient as shown in Table 1. We observed that the segmentation model demonstrated higher values for the Dice coefficient compared to the IoU metrics due to the way the two functions are defined. The Dice coefficient value is given by twice the area of the intersection of two masks, divided by the sum of the areas of the masks. It is observed from Table 1 that, considering individual models, the segmentation model trained to minimize the focal Tversky loss demonstrated superior performance in terms of IoU, Dice coefficient, and accuracy metrics, followed by those trained with Tversky and weighted BCE-Dice losses. These top-3 performing models are used to construct the ensemble. Here, the lung masks predicted by the top-3 performing models are bitwise-ANDed to produce the final lung mask. We observed that the IoU, Dice coefficient, and accuracy, achieved using the bitwise-ANDed model ensemble are superior compared to any individual constituent model. However, we observed no statistically significant difference in performance (p > 0.05) between the individual models and the ensemble. We used the top-3 performing models and the bitwise-ANDed ensemble approach to predict lung masks for the CXRs in the pediatric pneumonia CXR collection. As the ground truth lung masks for these CXRs are not made available by the authors of [6], the segmentation performance could not be validated. The predicted lung masks are overlaid on the original CXRs to delineate the lung boundaries and are cropped. The cropped images are resized to 512×512 pixel resolution and used for further analysis (i.e., disease classification).

CXR disease classification
Recall that the encoder from the trained EfficientNet-B0-based U-Net model is truncated and appended with classification layers. This approach is followed to perform a CXR modality-specific knowledge transfer [2,15,16,30] to improve performance in a relevant task of classifying the CXRs in the pediatric pneumonia CXR collection into normal, bacterial pneumonia, or viral pneumonia categories. The classification models are trained to minimize the existing and proposed loss functions in this study. Table 2 summarizes the classification performance achieved by these models. We measured the 95% CI as the exact Clopper-Pearson interval for  The top-3 (i.e., models that are trained to minimize the calibrated CCE, CCE with entropybased regularization, and calibrated negative entropy losses) and top-5 (i.e., models that are trained to minimize the calibrated CCE, CCE with entropy-based regularization, calibrated negative entropy, label-smoothed categorical focal, and calibrated categorical Hinge losses) are used to construct prediction-level and model-level ensembles. Recall that for the predictionlevel ensemble, the models' predictions are combined using majority voting, simple averaging, weighted averaging, and stacking-based ensemble methods. Table 3 summarizes the classification performance achieved by the prediction-level ensembles.
It is observed from Table 3 that the prediction-level ensembles constructed using the top-3 and top-5 performing models demonstrated higher values for F-score as compared to the MCC metrics for the reasons discussed before. The weighted averaging ensemble of the top-5 performing models using the optimal weights [0.40560531, 0.192276399, 0.00356809023, 0.3985502, 1.10927275e-16] calculated using the SLSQP method achieved superior performance compared to other ensembles. The 95% CI obtained using the MCC metric demonstrated a tighter error margin and hence higher precision compared to other ensemble methods. However, we observed no statistically significant difference (p > 0.05) in performance across the ensemble methods. Fig 4 shows the confusion matrix, AUROC, and AUPRC curves achieved using the top-5 weighted averaging ensemble. Recall that the model-level ensembles are constructed using the top-K (K = 3, 5) models by instantiating them with their trained weights and truncating them at their deepest convolutional layers. The feature maps from these layers are concatenated and appended with a 1×1 convolutional layer for feature dimensionality reduction. In our study, the feature maps of the deepest convolutional layers for the models have [16,16,512] dimensions. Hence, after concatenation, the feature maps for the top-3 models are of [16,16,1536] dimensions, and that for the top-5 models are of [16,16,2560] dimensions. We used 1×1 convolutions to reduce these dimensions to [16,16,512]. The 1×1 convolutional layer is appended with a GAP and dense layer with three neurons to classify the CXRs into normal, bacterial pneumonia, or viral pneumonia categories. Table 4 shows the classification performance achieved in this regard. We observed no statistically significant difference (p > 0.05) in performance between the top-3 and top-5 model-level ensembles. We further performed a weighted averaging of the predictions of the top-3 and top-5 model-level ensembles. We calculated the optimal weights [0.3764, 0.6236] using the SLSQP method to improve performance. Fig 5 shows the confusion matrix, AUROC, and AUPRC curves obtained by the weighted averaging ensemble using the predictions of the top-3 and top-5 model-level ensembles. We observed that this ensemble approach demonstrated superior performance for all metrics compared to the individual models and all ensemble methods discussed in this study. Table 5 shows a comparison of the performance achieved with (i) the weighted averaging ensemble of top-3 and top-5 model-level predictions and (ii) SOTA literature.
The authors of [6] that released the pediatric pneumonia CXR dataset performed binary classification to classify the CXRs as showing normal lungs or other abnormal manifestations. To the best of our knowledge, only the authors of [24] performed a multi-class classification using the train and test splits released by the authors of [6]. We observed that the MCC metric achieved by the weighted averaging ensemble of top-3 and top-5 model-level predictions is significantly superior (p < 0.05) compared to the MCC metric reported in the literature [24].
Disease ROI localization. We used Grad-CAM tools [32] for localizing the disease-manifested ROIs to ensure that the models learned meaningful features. Fig 6 shows

Discussion and conclusions
While several studies [33, 34] report using the pediatric pneumonia CXR dataset [6] in a binary classification setting, only the authors of [24] trained models for a multi-class classification task. Further, studies in [33, 34] used ImageNet-pretrained models to transfer knowledge to a target CXR classification task as opposed to a CXR modality-specific pretrained model. Such transfer of knowledge may not be relevant since the characteristics of natural images are

PLOS ONE
distinct from medical images. In this work, we propose to resolve the aforementioned issues by transferring knowledge from a CXR modality-specific pretrained model to improve performance in a relevant CXR classification task. We trained the models using existing loss functions and also proposed several loss functions. Our experimental results showed that the model trained to minimize the calibrated CCE loss demonstrated superior values for all metrics. This performance is followed by those that are trained to minimize the proposed losses such as CCE with entropy-based regularization, calibrated negative entropy, label-smoothed categorical focal, and calibrated categorical Hinge loss. We evaluated the performance of both prediction-level and model-level ensembles. We observed from the experiments that the model-level ensembles demonstrated markedly improved performance than the prediction-level ensembles. We further improved performance by (i) deriving optimal weights using the SLSQP method, and (ii) using the derived weights to perform weighted averaging of the predictions of top-3 and top-5 model-level ensembles. We observed that the weighted averaging ensemble demonstrated superior performance for all metrics compared to other individual models, their ensemble, and the SOTA literature. Finally, we used Grad-CAM-based visualization tools to interpret the learned weights in the individual models and model-level ensembles. We observed that these models precisely localized the ROIs showing disease manifestations, confirming the expert's knowledge of the problem.
Our study combined the benefits of (i) performing CXR modality-specific knowledge transfer, (ii) proposing loss functions that delivered superior classification performance in a multiclass classification setting, (iii) constructing prediction-level and model-level ensembles to achieve SOTA performance as shown in Table 5. However, there are a few limitations to this study. For example, novel loss functions could be proposed for classification tasks to train models and their ensembles. Other ensemble methods such as blending and snapshot ensembles could also be attempted to improve performance. It is becoming increasingly viable to deploy ensemble models in real-time for image and video analysis with the advent of low-cost